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Abstract 

Sparse coding is a basic task in many fields including signal processing, neuroscience and 
machine learning where the goal is to learn a basis that enables a sparse representation of 
a given set of data, if one exists. Its standard formulation is as a non-convex optimization 
problem which is solved in practice by heuristics based on alternating minimization. Re¬ 
cent work has resulted in several algorithms for sparse coding with provable guarantees, 
but somewhat surprisingly these are outperformed by the simple alternating minimization 
heuristics. Here we give a general framework for understanding alternating minimization 
which we leverage to analyze existing heuristics and to design new ones also with provable 
guarantees. Some of these algorithms seem implementable on simple neural architectures, 
which was the original motivation of Olshausen and Field (1997a) in introducing sparse 
coding. We also give the first efficient algorithm for sparse coding that works almost up to 
the information theoretic limit for sparse recovery on incoherent dictionaries. All previous 
algorithms that approached or surpassed this limit run in time exponential in some natural 
parameter. Finally, our algorithms improve upon the sample complexity of existing ap¬ 
proaches. We believe that our analysis framework will have applications in other settings 
where simple iterative algorithms are used. 

1. Introduction 

Sparse coding or dictionary learning consists of learning to express (i.e., code ) a set of input 
vectors, say image patches, as linear combinations of a small number of vectors chosen 
from a large dictionary. It is a basic task in many fields. In signal processing, a wide 
variety of signals turn out to be sparse in an appropriately chosen basis (see references 
in Mallat (1998)). In neuroscience, sparse representations are believed to improve energy 
efficiency of the brain by allowing most neurons to be inactive at any given time. In machine 
learning, imposing sparsity as a constraint on the representation is a useful way to avoid 
over-fitting. Additionally, methods for sparse coding can be thought of as a tool for feature 
extraction and are the basis for a number of important tasks in image processing such as 
segmentation, retrieval, de-noising and super-resolution (see references in Elad (2010)), as 
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well as a building block for some deep learning architectures Ranzato et al. (2007). It is 
also a basic problem in linear algebra itself since it involves finding a better basis. 

The notion was introduced by neuroscientists Olshausen and Field (1997a) who formal¬ 
ized it as follows: Given a dataset y^\ y( 2 \ ..., y ^ E fi n , our goal is to find a set of basis 
vectors A2,..., A m E and sparse coefficient vectors x^\ x^ 2 ',..., E that 
minimize the reconstruction error 

Ell9 <i) -^-x <i) ||! + i>(x< i) ) f 1 ) 

2=1 2=1 

where A is the nxm coding matrix whose jth column is Aj and S(-) is a nonlinear penalty 
function that is used to encourage sparsity. This function is nonconvex because both A and 
the x®’s are unknown. Their paper as well as subsequent work chooses rn to be larger 
than n (so-called overcomplete case) because this allows greater flexibility in adapting the 
representation to the data. We remark that sparse coding should not be confused with the 
related — and usually easier — problem of finding the sparse representations of the yW’s 
given the coding matrix A, variously called compressed sensing or sparse recovery Candes 
et al. (2006); Candes and Tao (2005). 

Olshausen and Field also gave a local search/gradient descent heuristic for trying to min¬ 
imize the nonconvex energy function (1). They gave experimental evidence that it produces 
coding matrices for image patches that resemble known features (such as Gabor filters) in 
V 1 portion of the visual cortex. A related paper of the same authors Olshausen and Field 
(1997b) (and also Lewicki and Sejnowski (2000)) places sparse coding in a more familiar 
generative model setting whereby the data points yW’s are assumed to be probabilistically 
generated according to a model y ® = A* ■ x*^ + noise where x*^ l \ x*^ 2 \ ..., x are sam¬ 
ples from some appropriate distribution and A* is an unknown code. Then one can define 
the maximum likelihood estimate, and this leads to a different and usually more complicated 
energy function — and associated heuristics — compared to (1). 

Surprisingly, maximum likelihood-based approaches seem unnecessary in practice and 
local search/gradient descent on the energy function (1) with hard constraints works well, 
as do related algorithms such as MOD Aharon et al. (2006) and fc-SVD Engan et al. (1999). 
In fact these methods are so effective that sparse coding is considered in practice to be a 
solved problem, even though it has no polynomial time algorithm per se. 

Efficient Algorithms vs Neural Algorithms. Recently, there has been rapid progress 
on designing polynomial time algorithms for sparse coding with provable guarantees (the 
relevant papers are discussed below). All of these adopt the generative model viewpoint 
sketched above. But the surprising success of the simple descent heuristics has remained 
largely unexplained. Empirically, these heuristics far out perform — in running time, sample 
complexity, and solution quality — the new algorithms, and this (startling) observation was 
in fact the starting point for the current work. 

Of course, the famous example of simplex v s ellipsoid for linear programming reminds us 
that it can be much more challenging to analyze the behavior of an empirically successful 
algorithm than it is to to design a new polynomial time algorithm from scratch! But 
for sparse coding the simple intuitive heuristics are important for another reason beyond 
just their algorithmic efficiency: they appear to be implementable in neural architectures. 
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(Roughly speaking, this means that the algorithm stores the code matrix A as synapse 
weights in a neural network and updates the entries using differences in potentials of the 
synapse’s endpoints.) Since neural computation — and also deep learning — have proven to 
be difficult to analyze in general, analyzing sparse coding thoroughly seems to be a natural 
first step for theory. Our algorithm is a close relative of the Olshausen-Field algorithm and 
thus inherits its neural implementability; see Appendix E for further discussion. 

Here we present a rigorous analysis of the simple energy minimization heuristic, and as 
a side benefit this yields bounds on running time and sample complexity for sparse coding 
that are better (in some cases, dramatically so) than the algorithms in recent papers. This 
adds to the recent literature on analyzing alternating minimization Jain et al. (2013); Hardt 
(2013); Netrapalli et al. (2013, 2014) but these work in a setting where there is a convex 
program that is known to work too, and in our setting, the only known convex program 
runs in time exponential in a natural parameter Barak et al. (2014). 

1.1. Recent Work 

A common thread in recent work on sparse coding is to assume a generative model; the 
precise details vary, but each has the property that given enough samples the solution is 
essentially unique. Spielman et al. (2012) gave an algorithm that succeeds when A* has full 
column rank (in particular m < n ) which works up to sparsity roughly yfn. However this 
algorithm is not applicable in the more prevalent overcomplete setting. Arora et al. (2014) 
and Agarwal et al. (2013, 2014) independently gave algorithms in the overcomplete case 
assuming that A* is /x-incoherent (which we define in the next section). The former gave an 
algorithm that works up to sparsity n 1 / 2-7 /// for any 7 > 0 but the running time is n 0 ^ 1 * / 7 ); 
Agarwal et al. (2013, 2014) gave an algorithm that works up to sparsity either n 1//4 /^x or 
n 1 / 6 /// depending on the particular assumptions on the model. These works also analyze 
alternating minimization but assume that it starts from an estimate A that is column-wise 
l/poly(n)-close to A*, in which case the objective function is essentially convex. 

Barak et al. (2014) gave a new approach based on the sum-of-squares hierarchy that 
works for sparsity up to n 1-7 for any 7 > 0. But in order to output an estimate that is 
column-wise e-close to A* the running time of the algorithm is n l ^ e ° W . In most applications, 
one needs to set (say) e = l/k in order to get a useful estimate. However in this case 
their algorithm runs in exponential time. The sample complexity of the above algorithms 
is also rather large, and is at least Q(m 2 ) if not much larger. Here we will give simple 
and more efficient algorithms based on alternating minimization whose column-wise error 
decreases geometrically, and that work for sparsity up to n 1 / 2 //zlogn. We remark that even 
empirically, alternating minimization does not appear to work much beyond this bound. 

1.2. Model, Notation and Results 

We will work with the following family of generative models (similar to those in earlier 
papers) 1 : 

1. The casual reader should just think of x* as being drawn from some distribution that has independent 

coordinates. Even in this simpler setting —which has polynomial time algorithms using Independent 

Component Analysis —we do not know of any rigorous analysis of heuristics like Olshausen-Field. The 
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Our Model Each sample is generated as y = A*x* + noise where A* is a ground truth 
dictionary and x* is drawn from an unknown distribution V where 

(1) the support S = supp(x*) is of size at most k, Pr[* £ S] = Q(k/m) and Pr [i,j £ 
S'] = Q(k 2 /m 2 ) 

(2) the distribution is normalized so that E[x*|:r* ^ 0] = 0; E,[x* 2 \x* / 0] = 1 and when 
x* 7 ^ 0 , |x*| > C for some constant C < 1 and 

(3) the non-zero entries are pairwise independent and subgaussian, conditioned on the 
support. 

(4) The noise is Gaussian and independent across coordinates. 

Such models are natural since the original motivation behind sparse coding was to discover a 
code whose representations have the property that the coordinates are almost independent. 
We can relax most of the requirements above, at the expense of further restricting the 
sparsity, but will not detail such tradeoffs. 

The rest of the paper ignores the iid noise: it has little effect on our basic steps like 
computing inner products of samples or taking singular vectors, and easily tolerated so long 
as it stays smaller than the “signal.” 

We assume A* is an incoherent dictionary, since these are widespread in signal processing 
Elad (2010) and statistics Donoho and Huo (1999), and include various families of wavelets, 
Gabor filters as well as randomly generated dictionaries. 

Definition 1 An n x m matrix A whose columns are unit vectors is fa-incoherent if for all 
i / j we have (Ai,Aj) < fi/y/n. 

We also require that \\A* || = 0(^/m/n). However this can be relaxed within poly logarithmic 
factors by tightening the bound on the sparsity by the same factor. Throughout this paper 
we will say that A s is (<5, /t)-near to A* if after a permutation and sign flips its columns are 
within distance <5 and we have lid . 5 — A*|| < k||A*||. See also Definition 8 . We will use this 
notion to measure the progress of our algorithms. Moreover we will use g{n ) = 0*(f(n)) to 
signify that g{n) is upper bounded by Cf(n) for some small enough constant C. Finally, 
throughout this paper we will assume that k < O* (y/n/ /rlogn) and m = 0(n). Again, m 
can be allowed to be higher by lowering the sparsity. We assume all these conditions in our 
main theorems. 

Main Theorems In Section 2 we give a general framework for analyzing alternating 
minimization. Instead of thinking of the algorithm as trying to minimize a known non- 
convex function, we view it as trying to minimize an unknown convex function. Various 
update rules are shown to provide good approximations to the gradient of the unknown 
function. See Lemma 11, Lemma 28 and Lemma 33 for examples. We then leverage 
our framework to analyze existing heuristics and to design new ones also with provable 
guarantees. In Section 3, we prove: 

earlier papers were only interested in polynomial-time algorithms, so did not wish to assume indepen¬ 
dence. 
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Theorem 2 There is a neurally plausible algorithm which when initialized with an esti¬ 
mate A 0 that is ( 6 ,2 )-near to A* for 5 = 0*(l/logn), converges at a geometric rate to A* 
until the column-wise error is 0(y/k/ri). Furthermore the running time is 0{mnp) and the 
sample complexity is p = 0(mk) for each step. 

Additionally we give a neural architecture implementing our algorithm in Appendix E. To 
the best of our knowledge, this is the first neurally plausible algorithm for sparse coding 
with provable convergence. 

Having set up our general framework and analysis technique we can use it on other 
variants of alternating minimization. Section 3.1.2 gives a new update rule whose bias (i.e., 
error) is negligible: 

Theorem 3 There is an algorithm which when initialized with an estimate A 0 that is (5, 2)- 
near to A* for 5 = 0*(l/logn), converges at a geometric rate to A* until the column-wise 
error is Furthermore each step runs in time 0(mnp) and the sample complexity 

p is polynomial. 

This algorithm is based on a modification where we carefully project out components along 
the column currently being updated. We complement the above theorems by revisiting the 
Olshausen-Field rule and analyzing a variant of it in Section 3.1.1 (Theorem 12). However 
its analysis is more complex because we need to bound some quadratic error terms. It uses 
convex programming. 

What remains is to give a method to initialize these iterative algorithms. We give a new 
approach based on pair-wise reweighting and we prove that it returns an estimate A 0 that 
is (5, 2)-near to A* for 5 = 0*(l/logn) with high probability. As an additional benefit, 
this algorithm can be used even in settings where m is not known and this could help solve 
another problem in practice — that of model selection. In Section 5 we prove: 

Theorem 4 There is an algorithm which returns an estimate A 0 that is (5, 2) -near to A* 
for S = 0*(l/logn). Furthermore the running time is 0(mn 2 p) and the sample complexity 
p = 0(mk). 

This algorithm also admits a neural implementation, which is sketched in Appendix E. The 
proof currently requires a projection step that increases the run time though we suspect it 
is not needed. 

We remark that these algorithms work up to sparsity O* (y/n/ /rlogn) which is within a 
logarithmic factor of the information theoretic threshold for sparse recovery on incoherent 
dictionaries Donoho and Huo (1999); Gribonval and Nielsen (2003). All previous known 
algorithms that approach Arora et al. (2014) or surpass this sparsity Barak et al. (2014) 
run in time exponential in some natural parameter. Moreover, our algorithms are simple to 
describe and implement, and involve only basic operations. We believe that our framework 
will have applications beyond sparse coding, and could be used to show that simple, iterative 
algorithms can be powerful in other contexts as well by suggesting new ways to analyze them. 
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Algorithm 1 Generic Alternating Minimization Approach 

Initialize A 0 
Repeat for s = 0,1, T 

Decode: Find a sparse solution to A s x ^ = y® for i = 1,2, ...,p 
Set X s such that its columns are for i = 1,2, ...,p 
Update: A s+1 = A s — gg s where g s is the gradient of £(A S , X s ) with respect to A s 


2. Our Framework, and an Overview 

Here we describe our framework for analyzing alternating minimization. The generic scheme 
we will be interested in is given in Algorithm 1 and it alternates between updating the 
estimates A and X. It is a heuristic for minimizing the non-convex function in (1) where 
the penalty function is a hard constraint. The crucial step is if we fix X and compute the 
gradient of (1) with respect to A, we get: 

v 

X a £(A,X) = -2(y (i) - Ax«)(x«) T . 

2=1 

We then take a step in the opposite direction to update A. Here and throughout the 
paper g is the learning rate, and needs to be set appropriately. The challenge in analyzing 
this general algorithm is to identify a suitable “measure of progress”— called a Lyapunov 
function in dynamical systems and control theory — and show that it improves at each step 
(with high probability over the samples). We will measure the progress of our algorithms 
by the maximum column-wise difference between A and A*. 

In the next subsection, we identify sufficient conditions that guarantee progress. They 
are inspired by proofs in convex optimization. We view Algorithm 1 as trying to minimize 
an unknown convex function, specifically /(A) = £(A,X*), which is strictly convex and 
hence has a unique optimum that can be reached via gradient descent. This function is 
unknown since the algorithm does not know X*. The analysis will show that the direction 
of movement is correlated with A* — A s , which in turn is the gradient of the above func¬ 
tion. An independent paper of Balakrishnan et al. (2014) proposes a similar framework for 
analysing EM algorithms for hidden variable models. The difference is that their condition 
is really about the geometry of the objective function, though ours is about the property 
of the direction of movement. Therefore we have the flexibility to choose different decoding 
procedures. This flexibility allows us to have a closed form of X s and obtain a useful func¬ 
tional form of g s . The setup is reminiscent of stochastic gradient descent , which moves in a 
direction whose expectation is the gradient of a known convex function. By contrast, here 
the function /() is unknown, and furthermore the expectation of g s is not the true gradient 
and has bias. Due to the bias, we will only be able to prove that our algorithms reach an 
approximate optimum up to some error whose magnitude is determined by the bias. We 
can make the bias negligible using more complicated algorithms. 
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Approximate Gradient Descent 

Consider a general iterative algorithm that is trying to get to a desired solution z* (in our 
case z* = A* for some i). At step s it starts with a guess z s , computes some direction g s , 
and updates its estimate as: z s+1 = z s — gg s . The natural progress measure is || z* — z s || 2 , 
and below we will identify a sufficient condition for it to decrease in each step: 

Definition 5 A vector g s is (a, (3, e s )-correlated with z* if 

(g s ,z s - z*) > a||z s - z*\\ 2 + j3\\g s \\ 2 - e s . 

Remark: The traditional analysis of convex optimization corresponds to the setting where 
z* is the global optimum of some convex function /, and e s = 0. Specifically, if /(•) is 
2 a-strongly convex and l/(2/3)-smooth, then g s = V f(z s ) (a,/3, 0 )-correlated with z*. Also 
we will refer to e s as the bias. 

Theorem 6 Suppose g s satisfies Definition 5 for s = 1,2,..., T, and g satisfies 0 < g < 2/5 
and e = maxj =1 e s . Then for any s = 1,... ,T, 

\\z a+1 _ z *||2 < _ 2arj)\\z s - z*|| 2 + 2ge s 

In particular, the update rule above converges to z* geometrically with systematic error e/a 
in the sense that 

||2 s -z*|| 2 < (1 — 2ag) s \\z° — z*\\ 2 + e/a. 

Furthermore, if e s < ^|| z s — z* || 2 for s = 1,... ,T, then 

H^-zl 2 < (1 -ar]) s \\z 0 - z*\\ 2 . 

The proof closely follows existing proofs in convex optimization: 

Proof: [Proof of Theorem 6 ] We expand the error as 

\\z s+1 — z* || 2 = ||z s — z*\\ 2 — 2 rjg sT (z s — z*) + r 7 2 ||g s || 2 

= || 2 S — z*\\ 2 — rj (2 g sT (z s — z*) — ??||< 7 S |[ 2 ) 

< || z s — z*\\ 2 — g (2a||^ s — z* || 2 + (2/3 — f/)||s ,s || 2 — 2e s ) (Definition 5 and ?? < 2/3) 

< ||z s - z*|| 2 - g (2a\\z s - z* || 2 - 2e s ) 

< (1 — 2ag)\\z s — z *|| 2 + 2ge s 

Then solving this recurrence we have ||z s+1 — z*\\ 2 < (\—2ag) s+l R 2 + — where R = \\z° — z*\\. 

And furthermore if e s < 3(\\z s — z *|| 2 we have instead 

||z s+1 — 2 :*|| 2 < (1 — 2ag)\\z s — 2 :|| 2 + ag\\z s — z|| 2 = (1 — ag)\\z s — ^|| 2 

and this yields the second part of the theorem too. ■ 

In fact, we can extend the analysis above to obtain identical results for the case of 
constrained optimization. Suppose we are interested in optimizing a convex function f(z) 
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over a convex set B. The standard approach is to take a step in the direction of the gradient 
(or g s in our case) and then project into B after each iteration, namely, replace 2 S+1 by 
ProjgZ s+1 which is the closest point in B to z s+1 in Euclidean distance. It is well-known 
that if z* £ B, then ||Projg z — z*\\ < || z — z*||. Therefore we obtain the following as an 
immediate corollary to the above analysis: 

Corollary 7 Suppose g s satisfies Definition 5 for s = 1,2,...,T and set 0 < g < 2/3 
and e = maxj =1 e s . Further suppose that z* lies in a convex set B. Then the update rule 
z s+ 1 = Proj B (z s — gg s ) satisfies that for any s = 1,..., T, 

Wz 8 - z*|| 2 < (1 - 2ag) s \\z° - 2 *|| 2 + e/a 

In particular, z s converges to z* geometrically with systematic error e/a. Additionally if 
e s < )||| z s — z *|| 2 for s = l,...,T, then 

||z s -z*|| 2 < (1 — ag) s \\z° — z*\\ 2 

What remains is to derive a functional form for various update rules and show that 
these rules move in a direction g s that approximately points in the direction of the desired 
solution z* (under the assumption that our data is generated from a stochastic model that 
meets certain conditions). 

An Overview of Applying Our Framework 

Our framework clarifies that any improvement step meeting Definition 5 will also converge 
to an approximate optimum, which enables us to engineer other update rules that turn 
out to be easier to analyze. Indeed we first analyze a simpler update rule with g s = 
E[(y — A s x)sgn(x r )] in Section 3. Here sgn(-) is the coordinate-wise sign function. We 
then return to the Olshausen-Field update rule and analyze a variant of it in Section 3.1.1 
using approximate projected gradient descent. Finally, we design a new update rule in 
Section 3.1.2 where we carefully project out components along the column currently being 
updated. This has the effect of replacing one error term with another and results in an 
update rule with negligible bias. The main steps in showing that these update rules fit into 
our framework are given in Lemma 11, Lemma 28 and Lemma 33. 

How should the algorithm update XI The usual approach is to solve a sparse recovery 
problem with respect to the current code matrix A. However many of the standard basis 
pursuit algorithms (such as solving a linear program with an t\ penalty) are difficult to 
analyze when there is error in the code itself. This is in part because the solution does not 
have a closed form in terms of the code matrix. Instead we take a much simpler approach 
to solving the sparse recovery problem which uses matrix-vector multiplication followed by 
thresholding: In particular, we set x = threshold;^ ((A s ) T y), where threshold^ (■) keeps 
only the coordinates whose magnitude is at least C/2 and zeros out the rest. Recall that the 
non-zero coordinates in x* have magnitude at least C. This decoding rule recovers the signs 
and support of x correctly provided that A is column-wise 5-close to A* for 5 = O* {1 / \ogn). 
See Lemma 10. 

The rest of the analysis can be described as follows: If the signs and support of x are 
recovered correctly, then alternating minimization makes progress in each step. In fact this 
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holds each for much larger values of k than we consider; as high as n/(logn)°^ 1 \ (However, 
the explicit decoding rule fails for k > y/n/fi log n.) Thus it only remains to properly 
initialize A 0 so that it is close enough to A* to let the above decoding rule succeed. In 
Section 5 we give a new initialization procedure based on pair-wise reweighting that we 
prove works with high probability. This section may be of independent interest, since this 
algorithm can be used even in settings where m is not known and could help solve another 
problem in practice — that of model selection. See Lemma 20. 

3. A Neurally Plausible Algorithm with Provable Guarantees 

Here we will design and analyze a neurally plausible algorithm for sparse coding which is 
given in Algorithm 2, and we give a neural architecture implementing our algorithm in 
Appendix E. The fact that such a simple algorithm provably works sheds new light on how 
sparse coding might be accomplished in nature. Here and throughout this paper we will 
work with the following measure of closeness: 

Definition 8 A is 5-close to A* if there is a permutation n : [m] —> [m] and a choice of 
signs a : [m] —> {±1} such that || cr(i)A w ^ — A*\\ <5 for all i We say A is (5, ft)-near to A* 
if in addition ||A — A*|| < re||A*|| too. 

This is a natural measure to use, since we can only hope to learn the columns of A* up to 
relabeling and sign-flips. In our analysis, we will assume throughout that 7r(-) is the identity 
permutation and a(-) = +1 because our family of generative models is invariant under this 
relabeling and it will simplify our notation. 

Let sgn(-) denote the coordinate-wise sign function and recall that r/ is the learning rate, 
which we will soon set. Also we fix both 5, do = 0*(l/logn). We will also assume that in 
each iteration, our algorithm is given a fresh set of p samples. Our main theorem is: 

Theorem 9 Suppose that A 0 is (25,2)-near to A* and that y = @(m/k). Then if each 
update step in Algorithm 2 uses p = H(mk) fresh samples, we have 

E [||Af - A*|| 2 ] < (1 - t) s \\A° - A*|| 2 + 0(k/n ) 

for some 0 < r < 1/2 and for any s = 1,2, ..., T. In particular it converges to A* geometri¬ 
cally, until the column-wise error is 0(y/k/n). 

Our strategy is to prove that g s is (a, /?, e)-correlated (see Definition 5) with the desired 
solution A*, and then to prove that ||A|| never gets too large. We will first prove that if A 
is somewhat close to A* then the estimate x for the representation almost always has the 
correct support. Here and elsewhere in the paper, we use “very high probability” to mean 
that an event happens with probability at least 1 — l/nM 1 ). 

Lemma 10 Suppose that A s is 5-close to A*. Then with very high probability over the 
choice of the random sample y = A*x*: 

sgn(threshold C 7/2((^ s ) T y)) = sgn(x*) 
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Algorithm 2 Neurally Plausible Update Rule 

Initialize A 0 that is (<5o,2)-near to A* 

Repeat for s = 0,1 ,T 

Decode: = thresholdc/ 2 ((^- s ) T y^) for * = 1,2, ...,p 

i p 

Update: A s+1 = A s — gg s where g s = - ■ ^^(y^ — A s x^)sgn(x^) T 

p i= i 


We prove a more general version of this lemma (Lemma 23) in Appendix A; it is an ingredient 
in analyzing all of the update rules we consider in this paper. However this is just one step 
on the way towards proving that fj s is correlated with the true solution. 

The next step in our proof is to use the properties of the generative model to derive a 
new formula for fj s that is more amenable to analysis. We define g s to be the expectation 
of g s 

g s := E[? s ] = E[(y - A s x) sgn(x) T ] (2) 

where x := thresholdcy 2 ((A s ) T y) is the decoding of y. Let qi = Pr[x* / 0] and = 
Pr[x*s* 7 ^ 0], and define pi = E[xtsgn(x*)|x* ^ 0]. 

Here and in the rest of the paper, we will let 7 denote any vector whose norm is negligible 
(i.e. smaller than l/n c for any large constant C > 1). This will simplify our calculations. 
Also let A*_ i denote the matrix obtained from deleting the zth column of A*. The following 
lemma is the main step in our analysis. 

Lemma 11 Suppose that A s is (26,2)-near to A*. Then the update step in Algorithm 2 
takes the form E[A^ +1 ] = Af — ggf where gf = piqi (A fAf — A* + ef ± 7), and Af = (Af,A*) 
and 

4 = (AO, i diag(q i j) (A s _i) T ) A* / q t 
Moreover the norm of e\ can be bounded as ||ef|| < 0(k/n). 

Note that p l q l is a scaling constant and A* ~ 1; hence from the above formula we should 
expect that gf is well-correlated with Af — Af. 

Proof: Since A s is (26, 2)-near to A*, A s is 2<5-close to A*. We can now invoke Lemma 10 
and conclude that with high probability, sgn(x*) = sgn(x). Let T x * be the event that 
sgn(x*) = sgn(x), and let I 77 * be the indicator function of this event. 

To avoid the overwhelming number of appearances of the superscripts, let B = A s 
throughout this proof. Then we can write gf = E[(y — Rx)sgn(xj)]. Using the fact that 
lj 7 „ + \j: t = 1 and that T x * happens with very high probability: 

9i = E[(y - Bx)sgn(xi)ljr x ,] + E[(y - Rx)sgn(x*)l^J 

= E[(y- Bx)aga(xi)ljr xt ] ±7 (3) 

The key is that this allows us to essentially replace sgn(x) with sgn(x*). Moreover, 
let S = supp(x*). Note that when F x * happens S is also the support of x. Recall that 
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according to the decoding rule (where we have replaced A s by B for notational simplicity) 
x = threshold^ (-E> T i/). Therefore, xs = ( B T y)s = B^y = BgA*x*. Using the fact that 
the support of x is 5 again, we have Bx = Bg BpA*x*. Plugging it into equation (3): 

9i = E[(y - Bx)sgn(xi)ljr x „] ±7 = E[(J - B s Bg)A*x* • sgn(a;*)l^] ±7 
= E [(/ - B s Bg)A*x* • sgn(sj)] - E[(J - B s B T s )A*x* • sgn(^)l^J ± 7 
= E[(J-B Sj Bj)A*s-sgn(x?)]±7 


where again we have used the fact that J- x * happens with very high probability. Now we 
rewrite the expectation above using subconditioning where we first choose the support S of 
x*, and then we choose the nonzero values x* s . 


E[(J - B s Bg)A*x* • sgn(x*)j 


E[E [(I-BsB^)A* 


E [ Pl {I - B S B T S )A*] 


sgn(x-)|S'] 


where we use the fact that E[x* • sgn(x*)|5] = pi. Let R = S — {i}. Using the fact that 
b s b t s = b,bJ + B R Bl, we can split the quantity above into two parts, 

St = Pi E[(/ - B.BJ )A* + Pl V{B n B T R }A* 

= pm(l - BiBj ) A* + Pi ^B_idiag(qi tj )B^A* ± 7 . 

where diag (qij) is a m x m diagonal matrix whose (j,j)~ th entry is equal to and B_i 
is the matrix obtained by zeroing out the ith column of B. Here we used the fact that 
Pr[i 6 5] = % and Pr [i,j € S] = q ip 

Now we set B = A s , and rearranging the terms, we have gf = P iQi ({Af, A*)A^ — A* + ef ± 7 ) 
where e| = (^A s _ i diag(qij) (AiJ 7 ^ A* / qi, which can be bounded as follows 

||ef|| < \\A s _i\\ 2 maxqtj/qi < 0(k/m)\\A s \\ 2 = 0(k/n) 

j¥=i 

where the last step used the fact that < 0(k/m), which is an assumption of our 

generative model. ■ 


We will be able to reuse much of this analysis in Lemma 28 and Lemma 33 because we have 
derived to for a general decoding matrix B. 

In Section 4 we complete the analysis of Algorithm 2 in the infinite sample setting. 
In particular, in Section 4.1, we prove that if A s is (2<5, 2)-near to A* then gf is indeed 
(a, (3, e)-correlated with Ai (Lemma 16). Finally we prove that if A s is (2d, 2)-near to A* 
then ||A S+1 — A*|| < 2||A*|| (Lemma 17). These lemmas together with Theorem 6 imply 
Theorem 14, the simplified version of Theorem 9 where the number of samples P is assumed 
to be infinite (i.e. we have access to the true expectation g s ). In Appendix D we prove the 
sample complexity bounds we need and this completes the proof of Theorem 9. 


3.1. Further Applications 

Here we apply our framework to design and analyze further variants of alternating mini¬ 
mization. 
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3.1.1. Revisiting Olshausen-Field 

In this subsection we analyze a variant of the Olshausen-Field update rule. However there 
are quadratic error terms that arise in the expressions we derive for g s and bounding them 
is more challenging. We will also need to make (slightly) stronger assumptions on the 
distributional model that for distinct h,* 2,*3 we have qi lt i 2 ,i 3 = 0(k 3 /m 3 ) where = 

Pr[ii,*2,*3 G S']. 

Theorem 12 Suppose that A 0 is (25,2)-near to A* and that g = Q(m/k). There is a 
variant of Olshausen-Field (given in Algorithm 4 in Appendix B.l) for which at each step 
s we have 

||A S - A*\\ 2 f < (1 - r) s ||A° - A*\\p + 0(mk 2 /n 2 ) 

for some 0 < t < 1/2 and for any s = 1,2, ...,T. In particular it converges to A* geometri¬ 
cally until the error in Frobenius norm is 0(y/rnk/n). 

We defer the proof of the main theorem to Appendix B.l. Currently it uses a projection 
step (using convex programming) that may not be needed but the proof requires it. 

3.1.2. Removing the Systemic Error 

In this subsection, we design and analyze a new update rule that converges geometrically 
until the column-wise error is n~ u ^\ The basic idea is to engineer a new decoding matrix 
that projects out the components along the column currently being updated. This has the 
effect of replacing a certain error term in Lemma 11 with another term that goes to zero as 
A gets closer to A* (the earlier rules we have analyzed do not have this property). 

We will use to denote the decoding matrix used when updating the zth column 

in the sth step. Then we set = A % and B^' l) = Proj^Aj for j / i. Note that B^f 1 

(i.e. B^^ with the ith column removed) is now orthogonal to Aj. We will rely on this fact 
when we bound the error. We defer the proof of the main theorem to Appendix B.2. 

Theorem 13 Suppose that A 0 is (25,2)-near to A* and that g = Q(m/k). There is an 
algorithm (given in Algorithm 5 given in Appendix B.2) for which at each step s, we have 

||Af - A*|| 2 < (1 - t) s ||A° - A*|| 2 + n~ uW 

for some 0 < r < 1/2 and for any s = 1,2, ..., T. In particular it converges to A* geometri¬ 
cally until the column-wise error is n - ^ 1 ). 

4. Analysis of the Neural Algorithm 

In Lenuna 11 we gave a new (and more useful) expression that describes the update direction 
under the assumptions of our generative model. Here we will make crucial use of Lemma 11 
in order to prove that gf is (a, (3, e)-correlated with Aj (Lemma 15). Moreover we use 
Lemma 11 again to show that ||A S+1 — A*|| < 2||A*|| (Lenuna 17). Together, these auxiliary 
lemmas imply that the column-wise error decreases in the next step and moreover the errors 
across columns are uncorrelated. 
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We assume that each iteration of Algorithm 2 takes infinite number of samples, and prove 
the corresponding simplified version of Theorem 9. The proof of this Theorem highlights 
the essential ideas of behind the proof of the Theorem 9, which can be found at Appendix D. 

Theorem 14 Suppose that A 0 is {25,2)-near to A* and that r] = Q{m./k). Then if each 
update step in Algorithm 2 uses infinite number of samples at each iteration, we have 

IIA - All 2 < (1 - r)lA 0 - Af + Oik 2 In 2 ) 

for some 0 < r < 1/2 and for any s = 1,2,..., T. In particular it converges to A* geometri¬ 
cally until the column-wise error is Ofik/n). 

The proof is deferred to the end of this section. 

4.1. Making Progress 

In Lemma 11 we showed that gf = piqi{\Af — A* + ef + 7 ) where A * = ( Ai,A*). Here we 
will prove that gf is (a, fi, e)-correlated with Af. Recall that we fixed 6 = 0*( 1/ logn). The 
main intuition is that gf is mostly equal to piqfiAf — A*) with a small error term. 

Lemma 15 If a vector gf is equal to 4a(A — A) + v where ||u|| < a\\Af — All + C> then 
gf is {a, 1/100a, ( 2 /a)-correlated with A*, more specifically, 

(gf, A - A) > «ll A - All 2 + ^HNI 2 - C 2 /«- 

In particular, gf is (a, (3, e)-correlated with A*, where a = f l{k/m), (I > H{m/k) and 
e = 0(h 3 /mn 2 ). We can now apply Theorem 6 and conclude that the column-wise error 
gets smaller in the next step: 

Corollary 16 If A s is (25,2)-near to A* and 77 < minj(p*%(l — A = 0(m/k), then 
gf = Pi<li{\Af — A + e l + 7) {H(k/m),I}(m/k),0{k 3 /mn 2 ))-correlated with A*, and 
further 

IIA +1 - All 2 < (1 - 2077)11 A - All 2 + 0{gk 2 /n 2 ) 

Proof: [Proof of Lemma 15] Throughout this proof s is fixed and so we will omit the su¬ 
perscript s to simplify notations. By the assumption, g .j already has a component that 
is pointing to the correct direction A — A*, we only need to show that the norm of the 
extra term v is small enough. First we can bound the norm of gi by triangle inequality: 
||0i|| < 114a (A — A)ll + IMI < 5a||(A-A)ll+C, therefore ||^|| 2 < 50a 2 || (A - A)ll 2 + 2 C 2 - 
Also, we can bound the inner-product between gi and A — A by (gi, A — A) > 4ck11^4* — 

All 2 - IAIIIIA - All- 
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Using these bounds, we will show (< 7 $, 4* — A*) — a\\Ai — ^4* |[ 2 — \\9i\\ 2 + ( 2 jcn > 0. 

Indeed we have 


{9ii Ai 

A*) a\\Ai 4*| 2 100 J|,|| 2 + C 2 /a 

2 - Wdi 2 + C 2 /« 

1 100a s ' 

> 

4a\\Ai - 4* 2 - HullPi - 4*|| - a|4, - 4*| 

> 

3ct |4j - 4* || 2 - (a |4j - 4*|| + C)||4i - 4*|| 

- II ( A, - A*) || 2 + 2C 2 ) + C 2 /a 

> 

a |4 i - 4* 2 — C |4j - 4* + -C 2 /a 


= 

(V“ 4j — 4* — C/2\/«) 2 > 0. 



This completes the proof of the lemma. ■ 

Proof:[Proof of Corollary 16] We use the form in Lemma 11, gf = piqi(\j,Af — A* + ef + 7 ) 
where A* = (At, A*). We can write gf = Piqi(Af - A*) + Piqi{{ 1 - A i)Af + ef + 7), so when 
applying Lemma 15 we can use 4 a = piqi = Q(k/m ) and v = Piqi{( 1 — Aj)4f + ef + 7). The 
norm of u can be bounded in two terms, the first term Piqi{ 1 — A;)4| has norm Pi.qi{ 1 — A*) 
which is smaller than p,,qi\\Af — 4*||, and the second term has norm bounded by (f = 
0(k 2 /mn). 

By Lemma 15 we know the vector g\ is (fl(fe/m), Q(m/k), 0(/c 3 /m?r 2 ))-correlated with 
A s . Then by Theorem 6 we have the last part of the corollary. ■ 

4.2. Maintaining Nearness 

Lemma 17 Suppose that 4 s is (25,2)-near to A*. Then ||4 S+1 — 4*|| < 2||4*|| in Algo¬ 
rithm 2. 

Proof: As in the proof of the previous lemma, we will make crucial use of Lemma 11. 
Substituting and rearranging terms we have: 

4* + 1 -4* = A\ - A* - ggf 

= (1 - mQi){Af - A*)+g Pi qi(l - Af)4| - r 7 p^ 4 i i diag(% J ) (A s _i) T ^JA* ±7 

Our first goal is to write this equation in a more convenient form. In particular let U 
and V be matrices such that t/ ? ; = piqt{ 1 — Af)4f and Vi = pi ^4f_jdiag(gjj) (4f_j) T ^4t. 
Then we can re-write the above equation as: 

4 S + 1 _ j\* — (yp _ A*)diag(l — gpiqi) + gU — gV ± 7 

where diag(l — gpiqi) is the rn x m diagonal matrix whose entries along the diagonal are 
1 - gPiQi- 

We will bound the spectral norm of 4 S+1 — 4 by bounding the spectral norm of each of 
the matrices of right hand side. The first two terms are straightforward to bound: 

|| (4 s - 4*)diag(l - gp^i) || < ||4 S — 4* || - (1 - 77 min< 2(1 - Vt{gk/m))\\A*\\ 
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where the last inequality uses the assumption that pi = 0 ( 1 ) and (p < 0(k/m), and the 
assumption that ||7P — A*|| < 2 ||A*||. 

From the definition of U it follows that U = A s &mg{piqi{\ — A|)), and therefore 
||i/|| < 5 max.piqi\\A s \\ = o(k/m) ■ ||A*|| 

i 

where we have used the fact that Af > 1 — 5 and 5 = o(l), and ||A S || < ||A S — A*|| + ||A*|| = 

o(P 1 l). 

What remains is to bound the third term, and let us first introduce an auxiliary matrix 
Q which we define as follows: Q %% = 0 and Qi.j = qij(Af, A*) for i ^ j. It is easy to verify 
that the following claim: 

Claim 18 The ith column of A S Q is equal to (^A s _ i diag(qij) (Af^ ^jA* 

Therefore we can write V = A s Qdiag(pj). We will bound the spectral norm of Q by 
bounding its Frobenus norm instead. Then from the definition of A, we have that: 

\\Q\\f < (maxqij] ^ ^(Af, A*) 2 = 0(k 2 /m 2 )\\A* T A S \\ F 

Moreover since A* T A S is an mxm matrix, its Frobenius norm can be at most a \Jrn factor 
larger than its spectral norm. Hence we have 

liv'd < ^maxpi^j IIAIIIQH < 0(k 2 ^/rn/m 2 )\\A s \\ 2 \\A*\\ 

< o(k/m)\\A*\\ 

where the last inequality uses the fact that k = 0(y/n/ logn) and ||A S || < 0 (||A*||). 
Therefore, putting the pieces together we have: 

\\A S+1 -A*\\ < ||(A s - A*)diag(l - i]Piqi)\\ + \\rjU\\ + ||r?H|| ±7 

< 2(1 — Q(rjk/m))\\A\\ + o(r?fe/m)||>1*|| + o(rjk/m)\\A *|| ± 7 

< 2||A*|| 

and this completes the proof of the lemma. ■ 

4.3. Proof of Theorem 14 

We prove by induction on s. Our induction hypothesis is that the theorem is true at 
each step s and A s is (2<5, 2)-near to A*. The hypothesis is trivially true for s = 0. Now 
assuming the inductive hypothesis is true. Recall that Corollary 16 of Section 4.1 says 
that if A s is (25, 2)-near to A*, which is guaranteed by the inductive hypothesis, by then 
gf is indeed (fl(k/m),fl(mfk),0(k 3 /mn 2 ))- correlated with A*. Invoking our framework of 
analysis (Theorem 6 ), we have that 

\\ A S+1 _ ||2 < (1 _ T )P! _ A*|| 2 + 0(k 2 /n 2 ) < (1 - t) s+1 ||A° - A*|| 2 + Oik 2 In 2 ) 

Therefore it also follows that A s+1 is 25-close to A*. Then we invoke Lemma 17 to prove 
A s+1 has not too large spectral norm ||A S+1 — A*|| < 2 ||A*||, which completes the induction. 
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Algorithm 3 Pairwise Initialization 
Set L = 0 

While \L\ < m choose samples u and v 
Set M UtV = 

Compute the top two singular values o \, 02 and top singular vector z of M U)V 
If fj 1 > fl(k/m) and 02 < 0*(k/m\ogm ) 

If 2: is not within distance 1 / log m of any vector in L (even after sign flip), add 

z to L 

Set A such that its columns are zGl and output A = ProjgA where B is the convex set 
defined in Definition 30 


5. Initialization 

There is a large gap between theory and practice in terms of how to initialize alternating 
minimization. The usual approach is to set A randomly or to populate its columns with 
samples These often work but we do not know how to analyse them. Here we give 
a novel method for initialization which we show succeeds with very high probability. Our 
algorithm works by pairwise reweighting. Let u = A*a and v = A*a' be two samples from 
our model whose supports are U and V respectively. The main idea is that if we reweight 
fresh samples y with a factor (y, u) (y, v) and compute 

— 1 P2 

M v , v = -'£(yW,u)(y®,v)y®(yU) T 

then the top singular vectors will correspond to columns A* where j G U 0 V. (This is 
reminiscent of ideas in recent papers on dictionary learning, but more sample efficient.) 

Throughout this section we will assume that the algorithm is given two sets of samples 
of size pi and P 2 respectively. Let p = pi + P 2 - We use the first set of samples for the pairs 
u, v that are used in reweighting and we use the second set to compute M UjV (that is, the 
same set of P 2 samples is used for each u,v throughout the execution of the algorithm). 
Our main theorem is: 

Theorem 19 Suppose that Algorithm 3 is given p\ = fl(m) andp 2 = Vt{mk) fresh samples 
and moreover (a) A* is p-incoherent with p = O* ( fc ^ n ), (6) m = 0(n) and (c) ||A*|| < 
O(y^). Then with high probability A is (6, 2 )-near to A* where 6 = 0*(l/logn). 

We will defer the proof of this theorem to Appendix C. The key idea is the following 
main lemma. We will invoke this lemma several times in order to analyze Algorithm 3 to 
verify whether or not the supports of u and v share a common element, and again to show 
that if they do we can approximately recover the corresponding column of A* from the top 
singular vector of M u>v . 
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Lemma 20 Suppose u = A*a and v = A*a' are two random samples with supports U, 
V respectively. Let /3 = A* T u and f3' = A* T v. Let y = A*x* be random sample that is 
independent of u,v, then 

M UjV := E[(u,y)(v,y)yy T ] = qiCiPiP' i A*Af + E 1 +E 2 + E 3 , (4) 

ieunv 

where qi = Pr[i E S], Ci = E[x* 4 |i / S], and the error terms are: 

Ei = Yligurv qi c i(3iPiA*A* T 

E 2 = 

Es = £ td e[rn]^ j %M A W' j Af + ^A^jAf). 

Moreover the error terms E\ + E 2 + E 3 has spectral norm bounded by 0*(k/m. log n), \/3i\ > 
L2(1) for all i E supp(a) and \(3' i \ > fl(l) for all i E supp(c/). 


Proof: [Proof of Lemma 20] We will prove this lemma in three parts. First we compute the 
expectation and show it has the desired form. Recall that y = A*x*, and so: 


= 


E 

s 


= E 

S lx 


= E 

s 


E[(tt, A* s x* s )(v, A* s x* s )A* s x* s x* s t A* s t \S] 
x s 

[ E [{/3, x*s) ((3', x* s )A* s x* s x* s T Af\S] 

L X S J 

[j2aPiPi A *Af + £ (AflAjAf + ^'A*Af + ^A*A 


ies 


* a*T 

j 


QiCiPi!3iA*Af + J2 (PiPiA*Af + PiP'A*Af + fob A* A* 
i,je[m],iAj 


iEm 


where the second-to-last line follows because the entries in x* s are independent and have 
mean zero, and the only non-zero terms come from x * 4 (whose expectation is cf) and x* 2 x*j 2 
(whose expectation is one). Equation (4) now follows by rearranging the terms in the last 
line. What remains is to bound the spectral norm of E\. E 2 and E 3 . 

Next we establish some useful properties of /3 and ff : 


Claim 21 With high probability it holds that (a) for each i we have \/3i — cq| < — k ^ — and 
(b) ||/3|| < 0 (yjmk/n). 


In particular since the difference between /% an 07 is o(l) for our setting of parameters, we 
conclude that if a* 7 ^ 0 then C — o(l) < \(3f < O(logm) and if a* = 0 then |/3j| < . 

Proof: Recall that U is the support of a. and let R = U\{i}. Then: 

Pi-ai = AfA^au -Oi = AfA* R a R 
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and since A* is incoherent we have that ||A* t A!j|| < py/k/n. Moreover the entries 
in an are independent and subgaussian random variables, and so with high probability 
\(A* t A* r , or) | < m and this implies the first part of the claim. 

For the second part, we can bound ||/3|| < ||A*|| ||^4^-|j ||a||. Since a is a k -sparse vector 
with independent and subgaussian entries, with high probability ||a;|| < 0 {yfk) in which 
case it follows that ||/3|| < 0 {y/mk/n). ■ 

Now we are ready to bound the error terms. 

Claim 22 With high probability each of the error terms E\,E 2 and E% in (4) has spectral 
norm bounded at most 0 *{k/m\ogm). 


Proof: Let S = [m]\(U n V ), then E\ = A* S D\A* S T where D\ is a diagonal matrix whose 
entries are QiCifdifJf We first bound ||-Di||. To this end, we can invoke the first part of 
Claim 21 to conclude that |/3j/^| < M k j^ s — . Also = @(k/m ) and so 


Pill <0 


p 2 k 3 log 2 m 


mn 


= O 


/ pk 2 log 2 n 

\ my/n 


= O* (k / m\ogm) 


Finally 11M,g11 < ||A|| < 0(1) (where we have used the assumption that m = O(n)), and this 
yields the desired bound on \\E[ ||. 

The second term E 2 is a sum of positive semidefinite matrices and we will make crucial 
use of this fact below: 

El = Y,qijPi&A)Af A 0{k 2 /m 2 )(Y J PiP'?)(Y. A *j A f) ^ 0(fc 2 M 2 )ll^llll/5 , |l^^ T - 

i¥=j i 3 


Here the first inequality follows by using bounds on qi j and then completing the square. The 
second inequality uses Cauchy-Schwartz. We can now invoke the second part of Claim 21 
and conclude that \\E 2 W < 0(fc 2 /m 2 )||/3||||/3'||||A*|| 2 < 0 *{k/m\ogm) (where we have used 
the assumption that m = 0 (n)). 

For the third error term E 3 , by symmetry we need only consider terms of the form 
dijfitftjA*A* T . We can collect these terms and write them as A*QA* T , where Q t .j = 0 if 
i = j and Q,.j = qipPiftj if i j. First, we bound the Frobenius norm of Q: 

iiqiIp = / E 


/0(/c 4 /m 4 )( Y 


i£m 


je[m] 


Finally 111^311 < 2\\A* || 2 ||Q|| < 0(m/n-k 2 /m 2 )\\(3\\\\(3'\\ < O* (k/m log m ), and this completes 
the proof of the claim. ■ 


The proof of the main lemma is now complete. ■ 


Conclusions 

Going beyond y/n sparsity requires new ideas as alternating minimization appears to break 
down. Mysterious properties of alternating minimization are also left to explore, such as 
why a random initialization works. Are these heuristics information theoretically optimal 
in terms of their sample complexity? Finally, can we analyse energy minimization in other 
contexts as well? 
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Appendix A. Threshold Decoding 

Here we show that a simple thresholding method recovers the support of each sample with 
high probability (over the randomness of x*). This corresponds to the fact that sparse 
recovery for incoherent dictionaries is much easier when the non-zero coefficients do not 
take on a wide range of values; in particular, one does not need iterative pursuit algorithms 
in this case. As usual let y = A*x* be a sample from the model, and let S be the support of 
x*. Moreover suppose that A* is /x-incoherent and let A be column-wise (5-close to A. Then 

Lemma 23 If ^ < r j k and k = H*(logm) and 5 = 0*(l/y^og m), then with high prob¬ 
ability (over the choice of x*) we have S = {i : \(Ai,y)\ > C/2}. Also for all i G S 
sgn{(Ai,y)) = sgn{x*). 

Consider (Aj,y) = (A*, A*)x* + Z t where Z % = Y2j^i(Ai, A})x} is a mean zero random 
variable which measures the contribution of the cross-terms. Note that |(A,;, A*} > (1 — 
<5 2 /2), so |(Aj, A})x *| is either larger than (1 — S 2 /2)C or equal to zero depending on whether 
or not i G S. Our main goal is to show that the variable Z % is much smaller than C with 
high probability, and this follows by standard concentration bounds. 

Proof: Intuitively, Z x has two source of randomness: the support S of x*, and the ran¬ 
dom values of x* conditioned on the support. We prove a stronger statement that only 
requires second source of randomness. Namely, even conditioned on the support S, with 
high probability S = {i : |(Aj,y)| > C/2}. 

We remark that Z % is a sum of independent subgaussian random variables and the 
variance of Z{ is equal to X^eS\{i}(A, A*) 2 . Next we bound each term in the sum as 

(A,:, A*) 2 < 2((A*, A*) 2 + (A* — A *, A}) 2 ) < 2^ + 2(A, - A*, A*) 2 . 

On the other hand, we know ||A^ro|| < 2 by Gershgorin’s Disk Theorem. Therefore 
the second term can be bounded as X^eS\{i} (A* - A*, A*} 2 = ||A* u . } T (Aj - A*)11 2 < 
0*(1/ logm). Using this bound, we know the variance is at most 0*(l/logm): 

2 (A, A*) 2 < 2y 2 k + 2 (A — Aj, A*) 2 < 0*(l/logm). 
i6S\{i} jes\{i} 

Hence we have that Z t is a subgaussian random variable with variance at most 0*(l/logm) 
and so we conclude that Z % < C /4 with high probability. Finally we can take a union bound 
over all indices i G [m] and this completes the proof of the lemma. ■ 

In fact, even if k is much larger than y/n, as long as the spectral norm of A* is small 
and the support of x* is random enough, the support recovery is still correct. 

Lemma 24 If k = 0(n/ logn), y/\fn < l/log 2 n and 5 < O* (1/y/logm), the support of 
x* is a uniformly k-sparse set, then with high probability (over the choice of x) we have 
S = {i : |(Aj, y)\ > C/2}. Also for all i G S sgn((Ai,y)) = sgn(x*). 

The proof of this lemma is very similar to the previous one. However, in the previous case 
we only used the randomness after conditioning on the support, but to prove this stronger 
lemma we need to use the randomness of the support. 

First we will need the following elementary claim: 
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Claim 25 ||A* T Aj|| < 0(yjm/n) and \A*JAi\ < 0*(\/y/\ogm) for all j ^ i. 

Proof: The first part follows immediately from the assumption that A* and A are column¬ 
wise close and that ||A*|| = 0(y/m/n). The second part follows because \A* 1 j Af\ < 
\A*JA*\ + \A*J(A* -A)|<0*(1A/Kg^). ■ 

Let R = 5'\{i}. Recall that conditioned on choice of S', we have var(Zj) = ]W gR (Aj, A*) 2 . 
We will bound this term with high probability over the choice of R. First we bound its 
expectation: 

Lemma 26 E R^ j&R {Ai, A*) 2 ] < O(kfn) 

Proof: By assumption R is a uniformly random subset of [m] \ {?'} of size |i?| (this is either 
k or k — 1). Then 


e£(^ - A*, A*) 2 } = -j^L\\A* T A t \\ 2 = 0(k/n ), 

j£R m 

where the last step uses Claim 25. ■ 

However bounding the expected variance of Zj is not enough; we need a bound that 
holds with high probability over the choice of the support. Intuitively, we should expect to 
get bounds on the variance that hold with high probability because each term in the sum 
above (that bounds var(Zj)) is itself at most 0*(l/logm), which easily implies Theorem 23. 

Lemma 27 Aj) 2 < 0*(l/logm) with high probability over the choice of R. 

Proof: Let aj = (A*, A;-) 2 , then aj = O* (1/logm) and moreover Ylj^i a j = 0{m/n ) using 
the same idea as in the proof of Lemma 26. Hence we can apply Chernoff bounds and 
conclude that with high probability 'ffj^ajXj = ^- gK (^4i, A*) 2 < O* (1/log?n) where Xj 
is an indicator variable for whether or not j G R- ■ 

Proof: [Proof of Lemma 24] Using Lemma 27 we have that with high probability over the 
choice of R, var(Zi) < 0*1/ logm. In particular, conditioned on the support R, Z t is the 
sum of independent subgaussian variables and so with high probability (using Theorem ??) 

\Zi\ < 0{\/varZi log n) = 0*(1). 

Also as we saw before that |(Aj, A*)xf\ > (1 — 5 2 /2)C if i E S and is zero otherwise. So we 
conclude that \(Ai,y)\ > C/2 if and only if i G S which completes the proof. ■ 

Remark: In the above lemma we only needs the support of x satisfy concentration in¬ 
equality in Lemma 27. This does not really require S to be uniformly random. 
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Appendix B. More Alternating Minimization 

Here we prove Theorem 12 and Theorem 13. Note that in Algorithm 4 and Algorithm 5, we 
use the expectation of the gradient over the samples instead of the empirical average. We 
can show that these algorithms would maintain the same guarantees if we used p = kl(mk) 
to estimate g s as we did in Algorithm 2. However these proofs would require repeating very 
similar calculations to those that we performed in Appendix D, and so we only claim that 
these algorithms maintain their guarantees if they use a polynomial number of samples to 
approximate the expectation. 


B.l. Proof of Theorem 12 

We give a variant of the Olshausen-Field update rule in Algorithm 4. Our first goal is to 
prove that each column of g s is (a, /3, e)-correlated with A*. The main step is to prove an 
analogue of Lemma 11 that holds for the new update rule. 

Lemma 28 Suppose that A s is (25,5)-near to A* Then each column of g s in Algorithm f 
takes the form 

9i = Qi ((A|) 2 A| — AjAf + ef) 

where A* = ( Ai,A*). Moreover the norm of ef can be bounded as ||ef|| < 0(k 2 /mn). 


We remark that unlike the statement of Lemma 15, here we will not explicitly state the 
functional form of ef because we will not need it. 

Proof: The proof parallels that of Lemma 11, although we will use slightly different con¬ 
ditioning arguments as needed. Again, we define J- x * as the event that sgn(x*) = sgn(cc), 
and let lr , be the indicator function of this event. We can invoke Lemma 23 and conclude 
that this event happens with high probability. Moreover let J 7 , be the event that i is in the 
set S = supp(x'*) and let ljr ; be its indicator function. 

When event T x * happens, the decoding satisfies xs = A'gA^x^ and all the other entries 
are zero. Throughout this proof s is fixed and so we will omit the superscript s for notational 
convenience. We can now rewrite g,; as 


9i = E[(y- Ax)x T ] = E[(y- Ax)xJlj? x ,] + E[(y - Ax)xf (l - 1^*)] 


= E 
= E 
= E 


(I-A T s A s )A* s x* s x* s T A* s T A i 1^.1* ± 7 
(I — AgAs)A* s x* s x* s T A* S T Ailjr t 
(I-A T s A s )A* s x* s x* s T AfA t l Fi ± 7 
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Algorithm 4 Olshausen-Field Update Rule 

Initialize A 0 that is (<5o■ 2)-near to A* 

Repeat for s = 0,1 ,T 

Decode: x = thresholdc/ 2 (( j 4 s ) T ?/) for each sample y 
Update: A s+1 = A s — rjg s where g s = E [(y — A s x)x I ) 

Project: A s+1 = Proj e A s+1 (where B is defined in Definition 30) 


Once again our strategy is to rewrite the expectation above using subconditioning where 
we first choose the support S of x*, and then we choose the nonzero values x* s . 


9i 


= E 


S Li 


E[(7 - A t s A s )A* s x* s x* s t A* s t A 1 1^\S] 


±7 


E 

E 

E 


(I - A S A T S )A* S A* S T A l It, 


±7 


(I - AiAf - A R A R T )(A*Af + A* r A* r t )A 1 Ti 


±7 


(/ - A i Aj)(A*A* T )A i ljr i 


-E 


A R A R T A*AfA t l T . 


+ E 
-E 


{I-A i J^)A* R A* R T A i l Ti 


A r A^A* r A* r a At, 


±7 


Next we will compute the expectation of each of the terms on the right hand side. 
This part of the proof will be somewhat more involved than the proof of Lemma 11, 
because the terms above are quadratic instead of linear. The leading term is equal to 
qi(XiA* — X 2 A/) and the remaining terms contribute to e*. The second term is equal to 
(I — AiAf)A*_ i diag(qi t j)A^ i T Ai which has spectral norm bounded by 0(k 2 /mn). The 
third term is equal to AjA_jdiag(gjj)AU T A* which again has spectral norm bounded by 
0(k 2 /mn). The final term is equal to 


E 


A-RA' R A* R A R r Ai It, 


J1J2A 


— I ^2 ,h(Aj 2 > Ai){Aj 2 , A nt 
hAi 

= A-iV. 


A 


3 1 


where v is a vector whose j' 2 -th component is equal to 9i,h,h{A*j 2 , A/)(A* 2 , Aj/) . The 

absolute value of Vj 2 is bounded by 

Kl < 0(k 2 /m 2 )\(A‘ 2 , A ,)I + 0(k 3 /m 3 )( £ ((A*„ A,) 2 + (A*„ A h ) 2 )) 

< 0(k 2 /m 2 )\{A* 2 , At)\ + 0(k 3 /m 3 )\\A*\\ 2 = 0(k 2 /m 2 )(\(A* 2 , Ai)\ + k/n). 

The first inequality uses bounds for g’s and the AM-GM inequality, the second inequality 
uses the spectral norm of A*. We can now bound the norm of v as follows 

IMI < 0 (k 2 /m 2 ■ \J m/n ) 
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and this implies that the last term satisfies ||A_j||||u|| < 0(k 2 /mn). Combining all these 
bounds completes the proof of the lemma. ■ 

We are now ready to prove that the update rule satisfies Definition 5. This again uses 
Lemma 15, except that we invoke Lemma 28 instead. Combining these lemmas we obtain: 

Lemma 29 Suppose that A s is (25, 5)-near to A*. Then for each i, g | as defined in 
Algorithm 4 is (a, fi,e)-correlated with A*, where a = Q(k/m), /3 > Q(m/k) and e = 
0 (k 3 /mn 2 ). 

Notice that in the third step in Algorithm 4 we project back (with respect to Frobenius 
norm of the matrices) into a convex set B which we define below. Viewed as minimizing a 
convex function with convex constraints, this projection can be computed by various convex 
optimization algorithm, e.g. subgradient method (see Theorem 3.2.3 of Section 3.2.4 of 
Nesterov’s seminal Book Nesterov (2004) for more detail). Without this modification, it 
seems that the update rule given in Algorithm 4 does not necessarily preserve nearness. 

Definition 30 Let B = {A|A is <5o close to A 0 and ||A|| < 2||A*||} 

The crucial properties of this set are summarized in the following claim: 

Claim 31 (a) A* e B and ( b) for each A e B, A is (25o, 5)-near to A* 

Proof: The first part of the claim follows because by assumption A* is <5o-close to A 0 and 
||A*—A°|| < 2||A*||. Also the second part follows because \\A— A*|| < \\A— A°|| + ||A°—A*|| < 
4||A*||. This completes the proof of the claim. ■ 

By the convexity of B and the fact that A* e B, we have that projection doesn’t increase 
the error in Frobenius norm. 

Claim 32 For any matrix A, || Proj B A — A*||,p < ||A — 

We now have the tools to analyze Algorithm 4 by fitting it into the framework of 
Corollary 7. In particular, we prove that it converges to a globally optimal solution by 
connecting it to an approximate form of projected gradient descent: 

Proof: [Proof of Theorem 12] We note that projecting into B ensures that at the start of 
each step || A s — A*|| < 5|| A* ||. Hence gf is (f l(k/m), f l(m/k), 0(k 3 /mn 2 ))- correlated with A* 
for each i , which follows from Lemma 29. This implies that g s is (Ll(k/m), Fl(m/k), 0(k 3 /n 2 ))- 
correlated with A* in Frobenius norm. Finally we can apply Corollary 7 (on the matrices 
with Frobenius) to complete the proof of the theorem. ■ 
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Algorithm 5 Unbiased Update Rule 

Initialize A 0 that is (<5o,2)-near to A* 

Repeat for s = 0,1 ,T 

Decode: x = thresholdc/ 2 ((^. s ) T y) for each sample y 

x 1 = thresholdc/ 2 ((-E^ s ’^) T y) for each sample y, and each i E [m] 
Update: Af +1 = A? — rjgf where gf = E[(y — B^ s ’^ x l )sgn(x)f] for each i E [rri] 


B.2. Proof of Theorem 13 

The proof of Theorem 13 is parallel to that of Theorem 14 and Theorem 12. As usual, our 
first step is to show that g s is correlated with A*: 

Lemma 33 Suppose that A s is (5, 5 )-near to A*. Then for each i, gf as defined in Algo¬ 
rithm 5 is (a, f3,e)-correlated with A*, where a = fi(k/m), > Tt{m,/k) and e < n “‘‘d 1 ). 


Proof: We chose to write the proof of Lemma 11 so that we can reuse the calculation here. 
In particular, instead of substituting B for A s in the calculation we can substitute B^ s,i ^ 
instead and we get: 


Recall that Xf 


9 M = PWiKAf ~A* + RiV ) dia€(ffi I i)si? )T A*) + 7- 
(Af, A*). Now we can write g^ s ^ = piqfiAf — A*) + v, where 


v = Piqfi Af - 1 )Af + piq-iB^f diag(q i j)B ( fl z)T A* + 7 

Indeed the norm of the first term PiqfiXf — 1)A| is smaller than Piqi\\Af — A* ||. 

Recall that the second term was the main contribution to the systemic error, when we 

(s i)T 

analyzed earlier update rules. However in this case we can use the fact that Bff Af = 0 
to rewrite the second term above as 


Piq i B { fi? ) d.ia,g(q i j)B { f? )T (A* - Af) 

Hence we can bound the norm of the second term by 0(k 2 /mn)\\Af — A|||, which is also 
much smaller than p^cpfiAf — Af ||. 

Combining these two bounds we have that ||u|| < piqt\\Af — A*||/4 + 7 , so we can take 
£ = 7 = n~ u ^ lS) in Lemma 15. We can complete the proof by invoking Lemma 15 which 
implies that the g^ s,i ' is (Q(k/m), Q(m/k), rU^^-correlated with Aj. ■ 


This lemma would be all we would need, if we added a third step that projects onto 
B as we did in Algorithm 4. However here we do not need to project at all, because the 
update rule maintains nearness and thus we can avoid this computationally intensive step. 


Lemma 34 Suppose that A s is (S,2)-near to A*. Then ||A S+1 — A*|| < 2||A*|| in Algo¬ 
rithm 5. 
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This proof of the above lemma parallels that of Lemma 17. We will focus on highlighting 
the differences in bounding the error term, to avoid repeating the same calculation. 

Proof: [sketch] We will use A to denote A s and £>h) to denote to simplify the notation. 

Also let A t be normalized so that Ai = Ai/\\Ai\\ and then we can write = ( I—AiAJ)A-i. 

Hence the error term is given by 

(/ - AiAf )A„ i diag(q iJ )A T i (I - AiAj)A* 

Let C be a matrix whose columns are Ci = (I — AiAJ)A* = Ai — (Ai, A*)Ai. This implies 
that IICH < 0(i/ m/n). We can now rewrite the error term above as 

A_jdiag(gij)A T i C'i - (AiAi) T A- i diag(qi ! j)A 7 ) i Ci 

It follows from the proof of Lemma 17 that the first term above has spectral norm 
bounded by 0(k/m-^Jm/n). This is because in Lemma 17 we bounded the term A_jdiag((/ij)AAA* 
and in fact it is easily verified that all we used in that proof was the fact that j|A*|| = 
0(^/m/n), which also holds for C. 

All that remains is to bound the second term. We note that its columns are scalar mul¬ 
tiples of Ai, where the coefficient can be bounded as follows: ||Aj||||A_j|| 2 ||diag(q , i J )||||A*|| < 

0(k 2 /mn). Hence we can bound the spectral norm of the second term iby 0(k 2 /mn)\\A\\ = 
0*(k/m ■ yjm/n). We can now combine these two bounds, which together with the calcu¬ 
lation in Lemma 17 completes the proof. ■ 

These two lemmas directly imply Theorem 13. 

Appendix C. Analysis of Initialization 

Here we prove an infinite sample version of Theorem 19 by repeatedly invoking Lemma 20. 

We give sample complexity bounds for it in Appendix D.3 where we complete the proof of 
Theorem 19. 

Theorem 35 Under the assumption of Theorem 19, if Algorithm 3 has access to M u>v 
(defined in Lemma 20) instead of the empirical average then with high probability A 

is (S, 2) -near to A* where 6 = 0*(l/logn). 

Our first step is to use Lemma 20 to show that when u and v share a unique dictionary 
element, there is only one large term in M u , v and the error terms are small. Hence the top 
singular vector of M UiV must be close to the corresponding dictionary element Ai. 

Lemma 36 Under the assumptions of Theorem 19, suppose u = A*a and v = A*a' are 
two random samples with supports U, V respectively. When U n V = {i} the top singular 
vector of M u v is 0*(1/ \ogn)-close to A*. 

Proof: When u and v share a unique dictionary element i, the contribution of the first 
term in (4) is just qidf3il3' i A*A* T . Moreover the coefficient qiCifdiPi is at least Ll{k/m) which 
follows from Lemma 20 and from the assumptions that c* > 1 and qi = Ul{k/m). 
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On the other hand, the error terms are bounded by \\Ei + E 2 + -E 3 II < O*(k /mlogm) 
which again by Lemma 20. We can now apply Wedin’s Theorem (see e.g. Horn and Johnson 
(1990)) to 

M uv = qi cMA*Af + {E X + E 2 + E 3 ) 

'- - -' 

perturbation 

and conclude that its top singular vector must be 0*(k/mlogm)/£l(k/m) = 0 *(l/logm)- 
close to A*, and this completes the proof of the lemma. ■ 

Using (4) again, we can verify whether or not the supports of u and v share a unique 
element. 

Lemma 37 Suppose u = A*a and v = A*a' are two random samples with supports U, 
V respectively. Under the assumption of Theorem 19, if the top singular value of M UjV is 
at least £l(k/m) and the second largest singular value is at most 0*{k/m\ogm), then with 
high probability u and v share a unique dictionary element. 

Proof: By Lemma 20 we know with high probability the error terms have spectral norm 
0*{k/m. log m). Here we show when that happens, and the top singular value is at least 
Tt{k/m), second largest singular value is at most 0*{k/m\ogm), then u and v must share 
a unique dictionary element. 

If u and v share no dictionary element, then the main part in Equation (4) is empty, 
and the error term has spectral norm O* {k/m log m). In this case the top singular value of 
M UjV cannot be as large as Tl{k/m). 

If u and v share more than one dictionary element, there are more than one terms in 
the main part of (4). Let S = U n V, we know M UtV = A* s DsA* s T + E\ + E 2 + E 3 where 
Ds is a diagonal matrix whose entries are equal to qiCi/dif3[. All diagonal entries in Dg have 
magnitude at least Q(k/m). By incoherence we know A* s have smallest singular value at 
least 1/2, therefore the second largest singular value of A* s DgA* s T is at least: 

cr 2 (A* s D s A * s T ) > <Jmin(A* s ) 2 a 2 (D s ) > Q(k/m). 

Finally by Weyl’s theorem (see e.g. Horn and Johnson (1990)) we know a 2 (M U}V ) > 
a 2 (A* s DsA* s T ) — \\E\ + E- 2 + E 3 \\ > Cl{k/m). Therefore in this case the second largest 
singular value cannot be as small as 0*{k/mlogm). 

Combining the above two cases, we know when the top two singular values satisfy the 
conditions in the lemma, and the error terms are small, u and v share a unique dictionary 
element. ■ 

Finally, we are ready to prove Theorem 19. The idea is every vector added to the list 
L will be close to one of the dictionary elements (by Lemma 37), and for every dictionary 
element the list L contains at least one close vector because we have enough random samples. 

Proof: [Proof of Theorem 35] By Lemma 37 we know every vector added into L must be 
close to one of the dictionary elements. On the other hand, for any dictionary element A*, 
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by the bounded moment condition of V we know 

Pr[| Ur V\ = {*}] = Pr[i € U] Pr[i 6 V\ Pr[(C n H)\{i} = 0|t € U, j € U] 

> Pr[i € U] Pr[i e V](i - E^ iJeH P r\j eUnV\ieU,je V]) 

= £l(k 2 /m 2 ) • (1 — m ■ 0{k 2 /m 2 )) 

= Q(k 2 /m 2 ). 

Here the inequality uses union bound. Therefore given 0{m 2 log 2 n/k 2 ) trials, with high 
probability there is a pair of u,v that intersect uniquely at i for all i E [m]. By Lemma 36 
this implies there must be at least one vector that is close to A* for all dictionary elements. 

Finally, since all the dictionary elements have distance at least 1/2 (by incoherence), the 
connected components in L correctly identifies different dictionary elements. The output A 
must be 0*(l/logm) close to A*. ■ 

Appendix D. Sample Complexity 

In the previous sections, we analyzed various update rules assuming that the algorithm 
was given the exact expectation of some matrix-valued random variable. Here we show 
that these algorithms can just as well use approximations to the expectation (computed by 
taking a small number of samples). We will focus on analyzing the sample complexity of 
Algorithm 2, but a similar analysis extends to the other update rules as well. 

D.l. Generalizing the (a, (3, e)-correlated Condition 

We first give a generalization of the framework we presented in Section 2 that handles 
random update direction g s . 

Definition 38 A random vector g s is (a, /?, e s )-correlated-whp with a desired solution z* 
if with probability at least 1 — 

(g s ,z s - z*) > a||z s - z*\\ 2 + f3\\g s \\ 2 - e s . 

This is a strong condition as it requires the random vector is well-correlated with the 
desired solution with very high probability. In some cases we can further relax the definition 
as the following: 

Definition 39 A random vector g s is (cc, /3, e s )-correlated-in-expectation with a desired 
solution z* if 

E[(<f, - z*)\ > a\\z s - z*\\ 2 + /?E[||<f || 2 ] - e s . 

We remark that E[||g s || 2 ] can be much larger than || E[g ,s ]|| 2 , and so the above notion 
is still stronger than requiring (say) that the expected vector E[g s ] is (a, /3, e s )-correlated 
with 2 *. 
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Theorem 40 Suppose random vector g s is (a, (3, e s )-correlated-whp with z* for s = 1, 2,..., T 
where T < poly(ra), and rj satisfies 0 < g < 2/3, then for any s = 1,..., T, 

E[\\z s+1 - £*|| 2 ] < (1 - 2ar])\\z s - z*|| 2 + 2 ge s 

In particular, if Ijz 0 — z*\\ < 5o and e s < a ■ o((l — 2ag) s )5Q + e, then the updates converge 
to z* geometrically with systematic error e/a in the sense that 

E[\\z s - z*|| 2 ] < (1 - 2 ar,) s 5l + e/a. 

The proof is identical to that of Theorem 6 except that we take the expectation of both 
sides. 

D.2. Proof of Theorem 9 

In order to prove Theorem 9, we proceed in two steps. First we show when A s is (5 S , 2)-near 
to A*, the approximate gradient is {a, /3, e s )-correlated-whp with optimal solution A* , with 
e s < 0(k 2 /mn)+a-o(5 2 ). This allows us to use Theorem 40 as long as we can guarantee the 
spectral norm of A s — A* is small. Next we show a version of Lemma 17 which works even 
with the random approximate gradient, hence the nearness property is preserved during the 
iterations. These two steps are formalized in the following two lemmas, and we defer the 
proofs until the end of the section. 

Lemma 41 Suppose A s is (25, 2)-near to A* and g < minfipiqfil — 5)) = 0(m/k), then Iff 
as defined in Algorithm 2 is (a, /3 , e s )-correlated-whp with A* with a = El(k/m), (3 = Q(m/k) 
and e s < a ■ o(d 2 ) + 0(k 2 /mn). 

Lemma 42 Suppose A s is ( 5 S , 2)-near to A* with 5 S = 0*(1/ logn), and number of samples 
used in step s is p = El(mk), then with high probability ^4 S+1 satisfies ||7P +1 — j4*|| < 2||^4*||. 

We will prove these lemmas by bounding the difference between fjf and gf using various 
concentration inequalities. For example, we will use the fact that g/ is close to gf in 
Euclidean distance. 

Lemma 43 Suppose A s is ( 5 S , 2)-near to A* with 5 S = 0*(1/ logn), and number of samples 
used in step s is p = Q(mk), then with high probability || 'gf — gf || < 0(k/m) ■ (o(<5 s ) + 
0(y/kjn)). 

Using the above lemma, Lemma 41 now follows the fact that gf is correlated with A*. 
The proof of Lemma 42 mainly involves using matrix Bernstein’s inequality to bound the 
fluctuation of the spectral norm of 7l s+1 . 

Proof: [Proof of Theorem 9] The theorem now follows immediately by combining Lemma 41 
and Lemma 42, and then applying Theorem 40. ■ 
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D.3. Sample Complexity for Algorithm 3 

For the initialization procedure, when computing the reweighted covariance matrix M u v we 
can only take the empirical average over samples. Here we show with only H(mk) samples, 
the difference between the true M UjV matrix and the estimated M UjV matrix is already small 
enough. 

Lemma 44 In Algorithm 3, if p = H(mk) then with high probability for any pair u,v 
consider by Algorithm 3, we have ||M U> „ — M U)V \\ < O*(k /mlogn). 

The proof of this Lemma is deferred to Section D.4.3. notice that although in Algo¬ 
rithm 3, we need to estimate M u v for many pairs u and v, the samples used for different 
pairs do not need to be independent. Therefore we can partition the data into two parts, 
use the first part to sample pairs u,v, and use the second part to estimate M UjV . In this 
way, we know that for each pair u,v the whole initialization algorithm also takes f l(mk) 
samples. Now we are ready to prove Theorem 19. 

Proof: [Proof of Theorem 19] First of all, the conclusion of Lemma 36 is still true for M u v 
when p = Cl(mk). To see this, we could simply write 

M U}V = qidPifcA'Af + (E 1 + E 2 + E 3 ) + (M u>v - M u>v ) 

-- - -' 

perturbation 

where E\,E 2 , E 3 are the same as the proof of Lemma 36. We can now view M UjV — M u v as 
an additional perturbation term with the same magnitude. We have that when UHV = {i} 
the top singular vector of M UiV is 0*(l/logn)-close to A*. Similarly, we can prove the 
conclusion of Lemma 37 is also true for M Uj „. Note that we actually choose p such that the 
perturbation of matches noise level in Lemma 37. Finally, the proof of the theorem 
follows exactly that of the infinite sample case given in Theorem 35, except that we invoke 
the finite sample counterparts of Lemma 20 and Lemma 37 that we gave above. ■ 


D.4. Proofs of Auxiliary Lemmas 

Here we prove Lemma 41, Lemma 42, and Lemma 44 which will follow from various versions 
of the Bernstein inequality. We first recall Bernstein’s inequality that we are going to use 
several times in this section. Let Z be a random variable (which could be a vector or a 
matrix) chosen from some distribution V and let Z^\ Z^ 2 \ ..., Z^ be p independent and 
identically distributed samples from D. Bernstein’s inequality implies that if E [Z] = 0 and 
for each j, ||Z^|| < R almost surely and E[(Z^)) 2 ] < cr 2 , then 


E zli) 



(5) 


with high probability. The proofs below will involve computing good bounds on R and cr 2 . 
However in our setting, the random variables will not be bounded almost surely. We will 
use the following technical lemma to handle this issue. 
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Lemma 45 Suppose that the distribution of Z satisfies Pr[||Z|| > i?(log(l//?)) c ] < 1 — p 
for some constant C > 0, then 

(a) If p = nthen ||.Z^|| < 0{R) holds for each j with high probability and 

(b) l|E[Zl ||z||2a(H) ]||=n-“W. 

In particular, if ^ YTj=i Z ^(X — ^■||z(j)||>Q(fl)) concen t ra t e d with high probability, then 
p Ej=i zU) is too. 

Proof: The first part of the lemma follows from choosing p = n _logn and applying a union 
bound. The second part of the lemma follows from 

E[-^l||z||>Kio g 2c J < E[||Z||l|| Z ||> R i og 2 C J 

r oo 

= i?log 2 c nPr[||Z|| > ATog 2 c n] + / Pr[||Z|| > t]dt = n - ^ 1 ). 

J R log 2c n 

and this completes the proof. ■ 

All of the random variables we consider are themselves products of subgaussian random 
variables, so they satisfy the tail bounds in the above lemma. In the remaining proofs we 
will focus on bounding the norm of these variables with high probability. 

D.4.1. Proof of Lemma 43 and Lemma 41 

Since s is fixed throughout, we will use A to denote A s . Also we fix i in this proof. Let 
S denote the support of x* . Note that fji is a sum of random variable of the form ( y — 
Ar)sgn(xj). Therefore we are going to apply Bernstein inequality for proving fij concentrates 
around its mean gi. Since Bernstein is typically not tight for sparse random varaibles like in 
our case. We study the concentration of the random variable Z := (y — Ax)sgn(xj) | * e S 
first. We prove the following technical lemma at the end of this section. 

Claim 46 Let Z^\ ..., Z^ be i.i.d random variables with the same distribution as Z := 
(y — Ax)sgn(xi ) | i G S. Then when t = Ll(k 2 ), 

1 £ 

||_2>0)_E[Z]|| < o(Ss) + 0{y/kM 
3 =1 

We begin by proving Lemma 43. 

Proof: [Proof of Lemma 43] Let W = {j : i £ supp(x*^^)} and then we have that 

9i = ^ ^(l/^ ~ ^ 0) )sgn(x l 0) ) 

Note that ^ —Ax^) sgn(x^) has the same distribution as | Ej=i f° r ^ = | W|, 

and indeed by concentration we have I = |W| = Ii(k 2 ) when p = Ll(mk). Also note that 
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E[(y — Ax)sgn(xj)] = qi • E [Z] with (/,; = 0(k/m). Therefore by Lemma 46 we have that 

i ^ 

]| 9i ~ 9i || < 0(k/m) • ||- ^ Z ^ j) ~ E[Z]|| < 0(k/m) • ( o(5 s ) + 0(y/kjn)) 

3 =1 

and this completes the proof. ■ 

Proof: [P roof of Lemma 41] Therefore using Lemma 11 we can write Tjf (whp) as gi = 
9i~ 9i + 9i = 4a (Af - A*) + v with ||u|| < a\\Af - A?|| + O(kfm) • ( o(S s ) + 0(y/kjn)). By 
Lemma 15 we have gi is (£l(k/m),tt(m/k),o(k/m ■ 5 2 ) + 0(k 2 /mn))- correlated-whp with 

A*. ■ 

Then it suffices to prove Claim 46. To this end, we apply the Bernstein’s inequality 
stated in equation 5 with the additional technical lemma 45. We are going to control the 
maximum norm of Z and as well as the variance of Z using Claim 47 and Claim 48 as 
follows: 

Claim 47 ||Z|| = \\{y — Ax)sgn{xi)\\ < 0{gk/y/n + k5 2 + V~kd s ) holds with high probability 

Proof: We write y — Ax = (Ag — AsAigA* s )x* s = (Alj — As)®)) + Ag(I — A^A* s )x* s and we 
will bound each term. For the first term, since A is <5 s -close to A* and |5| < 0(k ), we have 
that ||Ag — As||f < O(dsVk)- And for the second term, we have 

||a s (4aj-/)|| f <||/i s ||||(4aj-/)|| f 

< (Ills'll + <isVk)(\\(As — A* s ) t A* s \\ f + || A* S T A* s - I ||f) 

< (2 + <5 s v^)(||A£|| IIA s - A£|| F + gk/yfr) < 0{gk/yfh + 5 2 s k + Vk5 s ). 

Here we have repeatedly used the bound ||[/E||f < j|C/|||| H|| F and the fact that A* is g 
incoherent which implies ||A^|| < 2 . Recall that the entries in x* s are chosen independently 
of S and are subgaussian. Hence if M is fixed then ||Afxc|| < 0 (||M||f) holds with high 
probability. And so 

||(y - Ax)sgn(xj)|| < 0(||AJ, - A s ||f + \\A S (A^A* S - J)|| F ) < d(gk/y/n +kS 2 + Vkd s ) 
which holds with high probability and this completes the proof. ■ 

Next we bound the variance. 

Claim 48 E[||Z'|| 2 ] = E[||(y — Ax)sgn(xi)\\ 2 \i € S\ < 0(k 2 5 2 ) + 0(k 3 /n ) 

Proof: We can again use the fact that y — Ax = (AJ — AsAgAg)xg and that x$ is 
conditionally independent of S with E[P!j(adj) T ] = I and conclude 

E[||(y — Ax)sgn(aii)|| 2 |i eS]= E[||Ag — AsAgA* s \\ 2 F \ i e S] 

Then again we write A* s — AsA^AIj as (A* s — As) + Ag(Ikxk — A^A* S ), and the bound 
the Frobenius norm of the two terms separately. First, since A is <5 s -close to A*, we have 
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that Ag — Ag has column-wise norm at most 5 S and therefore ||A1j — As||i? < Vkd s . Second, 
note that 11||< O(Vk) since each column of A has norm 1 ± 6 S , we have that 


ElllAsOf — AsAsOIIf I * e s] ^ o(k) e[||(-Z fcxfc ~ As As) I If A e A 


< 0(k ) E 


J2(1~aJa*) 2 + J2 (Aj,Ai) 2 \ieS 


j&s 


j+idS 


We can now use the fact that A is <5 s -close to A *, expand out the expectation, and use 
the fact that Pr[j 6 S, £ G S' | i G S'] ^ 0(k 2 /m 2 ), to obtain 

E[||4s(/ — AsAs) \\p | * G S] 

< 0(k 2 5 2 ) + 0(k 3 /m 2 )- Y (A^A^ + O^/m^AjAUf + Oie/m^A^An 2 

j/G[m]\i 

< 0{k 2 d 2 ) + 0{k 3 /n) 

and this completes the proof. ■ 

Proof:[Proof of Claim 46] We apply first Bernstein’s inequality (5) with R = 0(y,k/y/n + 
kd 2 + VkS s ) and a 2 = 0{k 2 5 2 ) + 0{k 3 /n ) on random variable Z^\ 1 — Iii^Cj)||> r 2 ( J R))- Then 
by claim 47, claim 48 and Bernstein Inequality, we know that the truncated version of Z 
concentrates when i = Q(k 2 ), 

1 t 

- Y Z [j) { 1 - l||z 00 ||>n(fl)) ~ E[^(l - l||z||>Q(i?))] 
j =i 

Note that we choose £ = k 2 log c n for a large constant c so that it kills the log factors caused 
by Bernstein’s inequality. Then by Lemma 45, we have that ^ • Z^'Z also concentrates: 

1 ^ 

|| - £ Z«> - E[Z] || < o(i) + O(v^) 
j=i 

and this completes the proof. ■ 

D.4.2. Proof of Lemma 42 

Proof: [Proof of Lemma 42] We will apply the matrix Bernstein inequality. In order to do 
this, we need to establish bounds on the spectral norm and on the variance. For the spectral 
norm bound, we have ||(y — 7Px)sgn(:r) T || = ||(y — A s x)||||sgn(x)|| = Vk\\(y — A s :r)||. We 
can now use Claim 47 to conclude that ||(y — A s :r)|| < 0(k), and hence ||(y — vl s x)sgn(x)|| < 
0 (fc 3 / 2 ) holds with high probability. 

For the variance, we need to bound both E[(y — yl s x)sgn(x) T sgn(x)(y — A s x) T } and 
E[sgn(x)(y — A s x) T (y — A s x)sgn(x) T ]- The first term is equal to &E[(y — A s x)(y — A s x) T }. 
Again, the bound follows from the calculation in Lemma 28 and we conclude that 

|| E[(y — A s x)sgn(x) T sg n (x)(y — A s x) 7 ]|| < 0(k 2 /n) 


<Ol ? 1+0 



) = o{5 s )+0{y/kJn) 
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To bound the second term we note that 

E[sgn(x)(y — A s x) T (y — A s x)sgn(x) 7 ] A 0(k 2 ) E[sgn(x)sgn(x) i ] A 0(k 3 /m)I 

Moreover we can now apply the matrix Bernstein inequality and conclude that when the 
number of samples is at least Q(mk) we have 

I p _ 

|| - ^(y^ — A s x*^)sgn(x*^) T — E[(y — A s x)sgn(x) r || < 0*(k/m ■ y 7 m/n) 

^ 3 = 1 

and this completes the proof. ■ 

D.4.3. Proof of Lemma 44 

Again in order to apply the matrix Bernstein inequality we need to bound the spectral norm 
and the variance of each term of the form (u,y)(v,y)yy T . We make use of the following 
claim to bound the magnitude of the inner product: 

Claim 49 \(u,y)\ < 0(Vk ) and ||y|| < 0(\fk) hold with high probability 

Proof: Since u = A* a and because a is fc-sparse and has subgaussian non-zero entries we 
have that ||u|| < O(Vk), and the same bound holds for y too. Next we write |(u,y}| = 
|(A*gU, adj) | where S is the support of x*. Moreover for any set S, we have that 

Ms r «ll £ MslllMI < o(Vk) 

holds with high probability, again because the entries of x* s are subgaussian we conclude 
that \(u,y)\ < 0(y/k) with high probability. ■ 

This implies that || {u, y){v, y)yy 1 1| < 0(k 2 ) with high probability. 

Now we need to bound the variance: 

Claim 50 || E[(u, y) 2 (v, y) 2 yy T yy T ] II < 0(k 3 /m) 

Proof: We have that with high probability ||y|| 2 < 0(k ) and (u,y) 2 < O(k), and we can 
apply these bounds to obtain 

E [{u,y) 2 {v,y) 2 yy T yy T ] A 0(k 2 ) E[(u, y) 2 yy T ] 

On the other hand, notice that E [{v,y) 2 yy T ] = M v v and using Lemma 20 we have that 
II E[(u, y) 2 yy T ] || < 0(k/m). Hence we conclude that the variance term is bounded by 
0(k 3 /m). ■ 

Now we can apply the matrix Bernstein inequality and conclude that when the number 
of samples is p = Q(mk) then 

\\M U)V - M U)V || < d(k 2 )/p+ 0(k 3 /m,p) < 0*(k/m\ogn) 

with high probability, and this completes the proof. 
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Figure 1: A neural implementation of Algorithm 2, which mimics that of Olshausen-Field (Figure 5 
in Olshausen and Field (1997a)) 

Appendix E. Neural Implementation 

Neural Implementation of Alternating Minimization: Here we sketch a neural ar¬ 
chitecture implementing Algorithm 2, essentially mimicking Olshausen-Field (Figure 5 in 
Olshausen and Field (1997a)), except our decoding rule is much simpler and takes a single 
neuronal step. 

(a) The bottom layer of neurons take input y and at the top layer neurons output 
the decoding x of y with respect to the current code. The middle layer labeled r is used 
for intermediate computation. The code A is stored as the weights between the top and 
middle layer on the synapses. Moreover these weights are set via a Hebbian rule, and upon 
receiving a new sample y, we update the weight on the synapses by the product of the 
values of the two endpoint neurons, Xj and r t . 

(b) The top layer neurons are equipped with a threshold function. The middle layer 
ones are equipped with simple linear functions (no threshold). The bottom layer can only 
be changed by stimulation from outside the system. 

(c) We remark that the updates require some attention to timing, which can be ac¬ 
complished via spike timing. In particular, when a new image is presented, the value of 
all neurons are updated to a (nonlinear) function of the weighted sum of the values of its 
neighbors with weights on the corresponding synapses. The execution of the network is 
shown at the right hand side of the figure. Upon receiving a new sample at time t = 0, the 
values of bottom layer are set to be y and all the other layers are reset to zero. At time 
t = 1, the values in the middle layer are updated by the weighted sum of their neighbors, 
which is just y. Then at time t = 2, the top layer obtains the decoding x of y by calculating 
threshold^ /2 (A T y) . At time t = 3 the middle layer calculates the residual error y — Ax 
and then at time t = 4 the synapse weights that store A are updated via Hebbian rule 
(update proportional to the product of the endpoint values). Repeating this process with 
many images indeed implements Algorithm 2 and succeeds provided that the network is 
appropriately initialized to A 0 that is (5,2) near to A*. 
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Neural Implementation of Initialization: Here we sketch a neural implementation of 
Algorithm 19. This algorithm uses simple operations that can be implemented in neurons 
and composed, however unlike the above implementation we do not know of a two layer 
network, but note that this procedure need only be performed once so it need not have a 
particularly fast or short neural implementation. 

(a) It is standard to compute the inner products (y, u) and ( y , v) using neurons, and 
even the top singular vector can be computed using the classic Oja’s Rule Oja (1982) in an 
online manner where each sample y is received sequentially. There are also generalizations to 
computing other principle components Oja (1992). However, we only need the top singular 
vector and the top two singular values. 

(b) Also, the greedy clustering algorithm which preserves a single estimate z in each 
equivalence class of vectors that are 0*(l/logm)-close (after sign flips) can be implemented 
using inner products. Finally, projecting the estimate A onto the set B may not be required 
in real life (or even for correctness proof), but even if it is it can be accomplished via 
stochastic gradient descent where the gradient again makes use of the top singular vector 
of the matrix. 
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